Overview of Dataset

The dataset was obtained from Kaggle. It had 299 observations and 13 variables. the outcome variable ‘DEATH_EVENT’ indicates whether a patient died of heart failure or not based on 11 other predictors. The variable names are shown below:

NB: The 12th variable ‘time’ indicated the time from the start of the study after which the study was terminated. This,presumably,could be either because the subject was declared healthy, or dropped out of the study for various reasons, or died from heart failure. To avoid target leakage, since that time would not be available in real world instances when the resultant model is being used to predict the outcome of a new case, the ‘time’ variable would not be used as a feature to train the model.

## Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
##        'ejection_fraction', 'high_blood_pressure', 'platelets',
##        'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
##        'DEATH_EVENT'],
##       dtype='object')

Brief Exploratory Data Analysis

We can do a quick overview of the only two demographic variables from the dataset: age and sex. from the output below, we realize that the age range of the respondents is 40 to 95 years with a median age of 60 years and an average age of approximately 60 years.

## count    299.000000
## mean      60.833893
## std       11.894809
## min       40.000000
## 25%       51.000000
## 50%       60.000000
## 75%       70.000000
## max       95.000000
## Name: age, dtype: float64

Pre-selection of Features and Feature Engineering

We are closer to our goal of comparing the performance of various ML models on the dataset. Features here are pre-selected based on domain knowledge. Next, some feature engineering with Synthetic Minority Oversampling Technique (SMOTE). SMOTE utilizes a k-nearest neighbour algorithm helps to overcome the overfitting problem posed by random oversampling.In our dataset, the proportion of “No” examples for our outcome variable is much higher than “Yes” examples. Thus, there is the danger of our ML algorithms being biased if trained on this data as they would have way more “No” examples to learn from. I chose SMOTE instead of Random undersampling of the majority class because I want to preserve the data and not eliminate any examples since I do not have much training data to begin with!

Feature Selection

First, we identify features with low variance since they would not help the model much in finding patterns and de-select them. We will also check if there is multicollinearity amongst any of the features and we de-select one per pair.

VarianceThreshold(threshold=0.05)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## array([ True,  True,  True,  True,  True,  True,  True,  True])

From the results, per our threshold criteria, all the features have high variance.

##                                age   anaemia  ...  high_blood_pressure   smoking
## age                       1.000000  0.088006  ...             0.093289  0.018668
## anaemia                   0.088006  1.000000  ...             0.038182 -0.107290
## creatinine_phosphokinase -0.081584 -0.190741  ...            -0.070590  0.002421
## diabetes                 -0.101012 -0.012729  ...            -0.012732 -0.147173
## ejection_fraction         0.060098  0.031557  ...             0.024445 -0.067315
## serum_sodium             -0.045966  0.041882  ...             0.037109  0.004813
## high_blood_pressure       0.093289  0.038182  ...             1.000000 -0.055711
## smoking                   0.018668 -0.107290  ...            -0.055711  1.000000
## 
## [8 rows x 8 columns]